A Length-variable Feature Code Based Fuzzy Duplicates Elimination Approach for Large Scale Chinese WebPages

نویسندگان

  • Hongzhi Guo
  • Qingcai Chen
  • Cong Xin
  • Xiaolong Wang
چکیده

Most of the existing Chinese webpage duplicate elimination approaches do not focus on noisy and fuzzy duplicates elimination. In this paper, we propose an efficient and noise-tolerant Chinese webpage duplicate elimination approach based on Length-variable Feature Code. First, an Independent Extraction Unit is defined to eliminate the impact of short paragraphs on feature code extraction. Then the concept of repeatability is introduced by using the longest common substring to enhance the noise tolerant capability. Experimental results on 10 million webpage dataset show that the proposed approach can efficiently deal with duplicates from massive WebPages with the duplicate elimination precision of 99.03%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection

Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...

متن کامل

A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection

Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...

متن کامل

AN OBSERVER-BASED INTELLIGENT DECENTRALIZED VARIABLE STRUCTURE CONTROLLER FOR NONLINEAR NON-CANONICAL NON-AFFINE LARGE SCALE SYSTEMS

In this paper, an observer based fuzzy adaptive controller (FAC) is designed fora class of large scale systems with non-canonical non-affine nonlinear subsystems. It isassumed that functions of the subsystems and the interactions among subsystems areunknown. By constructing a new class of state observer for each follower, the proposedconsensus control method solves the problem of unmeasured sta...

متن کامل

A Variable Structure Observer Based Control Design for a Class of Large scale MIMO Nonlinear Systems

This paper fully discusses how to design an observer based decentralized fuzzy adaptive controller for a class of large scale multivariable non-canonical nonlinear systems with unknown functions of subsystems’ states. On-line tuning mechanisms to adjust both the parameters of the direct adaptive controller and observer that guarantee the ultimately boundedness of both the tracking error and tha...

متن کامل

A 2-D Numerical Investigation on the Modal Characteristics of Rotating-Stall with a Variable-Cascade- Length Approach in an Axial Compressor

The Rotating-Stall (RS) through a rotor-cascade of an axial compressor were numerically investigated with an unsteady two-dimensional finite-volume density based computer code. To validate the computer code, a test case was prepared and the good agreement of the compared results has given the adequate assurance of the code. The RS was incepted with a 40% reduction in flow coefficient and a 0.4%...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JSW

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2012